Part 2ΒΆ
Laura Silvana Alvarez, Florencia Luque and Simon SchmetzΒΆ
The following project documentation was written as work assignment for the module "Multivariate Analysis" of the Master in Statistics for Data Science at the Universidad Carlos III de Madrid. It contains the Multivariate Analysis of a Kaggel dataset on Sleep Health and Lifestyle (https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset). The work is split into two parts, where in a first part a exploratory data analysis is performed, some data preprocessing steps are taken and a Prinicipal Component Analysis (PCA) is performed. In the second part, distance based metrics in the form of Kmeans will then be applied to identify clusters emerging from the PCA. Additionally, as an alternative to PCA, Multi Dimensional Scaling is applied as second dimension reduction technique and compared to PCA in its effectiveness to reduce dimensions.
KmeansΒΆ
In the first step, we will compare different types of clustering using the K-Means algorithm. The first approach focuses solely on numeric variables, comparing two scenarios: one using the raw data (excluding variables with high correlations) and another using the PCA components derived in the first part of the analysis.
For the second approach, K-Means will be applied to mixed data, combining both numeric and categorical variables.
Kmeans with PCA dataΒΆ
If we compare the graphs, there is stabilization between 5 and 7 clusters in the Silhouette method. Similarly, in the Elbow method, the slope becomes less steep around 5 and 7 clusters.
Between the cluster 3 and 1 is some elapsing data within the PC1 vs PC2. This also ocurre within the 1 and the 0 in the PC1 vs PC3. The other clusters all well separarted in the first factorial plane.
Inertia (WCSS): 469.44182341248046 Silhouette Score: 0.5704171764668877
If we use 5 cluster there's seem to be more clarity between the clusters. The silhouette is also almost over 0.5 and it visible in the graphs.
If we compare the information we can said that the best number of cluster for the PCA data is 7, but we decide to keep 5, because it does not make sense to have that amount of clusters when the sample is not that large.
The principal characteristics of the clusters are given in the next table. Where in the categorical variables we have the mode of the cluster, and in the end we have the proportion of the cluster that has sleep disorders.
In cluster 2 and 4 it is shown that there are more nurse females with overweight. Despite the fact that the cluster 0 has a good perception of their quality of sleep, it is shown that both clusters have the biggest % sleep disorder (over 90%)
Cluster 3 have the lowest proportions of sleep disorder, most of the members of this cluster are men, lawyers with a normal bmi.
In general it can be concluded that the perception of the sleep quality is not directly related with having or not a sleep disorder.
| gender | occupation | quality_of_sleep | stress_level | bmi_category | sleep_disorder | sleep_disorder_percentage | |
|---|---|---|---|---|---|---|---|
| 0 | Female | Engineer | 9 | 3 | Normal | 0 | 40.00 |
| 1 | Male | Salesperson | 6 | 7 | Overweight | 0 | 48.78 |
| 2 | Female | Nurse | 6 | 8 | Overweight | 1 | 93.75 |
| 3 | Male | Lawyer | 8 | 5 | Normal | 0 | 6.34 |
| 4 | Female | Nurse | 9 | 3 | Overweight | 1 | 90.91 |
| age | sleep_duration | physical_activity_level | heart_rate | daily_steps | blood_pressure_systolic | blood_pressure_diastolic | |
|---|---|---|---|---|---|---|---|
| 0 | 47.646154 | 7.558462 | 38.538462 | 65.000000 | 5415.384615 | 129.153846 | 84.153846 |
| 1 | 38.731707 | 6.320732 | 39.560976 | 72.146341 | 5734.146341 | 127.256098 | 82.865854 |
| 2 | 49.750000 | 6.065625 | 90.000000 | 75.000000 | 10000.000000 | 140.000000 | 95.000000 |
| 3 | 37.070423 | 7.483803 | 71.035211 | 69.112676 | 7754.929577 | 122.880282 | 80.683099 |
| 4 | 58.030303 | 8.093939 | 75.000000 | 68.242424 | 6878.787879 | 140.000000 | 95.000000 |
Kmeans with raw dataΒΆ
For this, we need data without strong correlations. The variables related to blood pressure have a high correlation of almost 0.97. To avoid losing any information, we decided to create a new variable called pulse pressure. Pulse pressure is calculated as the difference between systolic and diastolic blood pressure.
If you compare the two graphs, there is stabilization around 4 clusters in the Silhouette Score graph and a noticeable decrease in the slope (tangent) in the Elbow Method graph. However, using 5 clusters appears slightly better, as it balances both metrics.
Inertia (WCSS): 694.6414160666794 Silhouette Score: 0.57221538910709
Inertia (WCSS): 694.6414160666794 Silhouette Score: 0.57221538910709
There is a larger decrease in WCSS and a slight improvement in the Silhouette Score when choosing 5 clusters. Additionally, using PCA to visualize the data reveals better separation of the clusters compared to 4 clusters. Choosing 5 clusters reduces overlaps between the groups, providing a more distinct and interpretable clustering structure.
The data description for the cluster is as follows.
| gender | occupation | quality_of_sleep | stress_level | bmi_category | sleep_disorder | sleep_disorder_percentage | |
|---|---|---|---|---|---|---|---|
| 0 | Male | Lawyer | 8 | 5 | Normal | 0 | 37.89 |
| 1 | Male | Doctor | 8 | 4 | Normal | 0 | 4.71 |
| 2 | Male | Salesperson | 6 | 7 | Overweight | 1 | 62.11 |
| 3 | Female | Nurse | 6 | 8 | Overweight | 1 | 88.24 |
| 4 | Female | Engineer | 9 | 3 | Normal | 0 | 40.00 |
| age | sleep_duration | physical_activity_level | heart_rate | daily_steps | blood_pressure_systolic | blood_pressure_diastolic | |
|---|---|---|---|---|---|---|---|
| 0 | 47.031579 | 7.735789 | 76.368421 | 68.715789 | 7618.947368 | 133.084211 | 88.178947 |
| 1 | 33.882353 | 7.362353 | 63.788235 | 69.176471 | 7510.588235 | 117.894118 | 77.847059 |
| 2 | 38.778947 | 6.409474 | 41.557895 | 74.305263 | 5213.684211 | 129.368421 | 84.157895 |
| 3 | 48.470588 | 6.073529 | 88.235294 | 75.000000 | 10000.000000 | 139.117647 | 94.117647 |
| 4 | 47.646154 | 7.558462 | 38.538462 | 65.000000 | 5415.384615 | 129.153846 | 84.153846 |
Kmeans with mixed dataΒΆ
For this, we will also use the pulse pressure variable and exclude the sleep disorder variable. This is because the sleep disorder variable is the target variable that this dataset aims to explain.
Now we have the assosiations between numerical and categorical variables finding some interesting relationships:
- Age is highly associated with occupation, quality of sleep and stress level.
- The hear rate is associated with quality of sleep, stress level and BMI category.
- Other important relations are gender with occupation and sleep disorder with bmi category.
In this case the elbow method does not show a clear point where the trend of the line change, but the silhouette method suggest to take 6 or 8 clusters.
Inertia (WCSS): 654.0051973653807 Silhouette Score: 0.5304769553918965
Inertia (WCSS): 848.3962463374016 Silhouette Score: 0.45668915344071426
If we compared the silhouette score and the cost. The best quantity of cluster for the mixed kmeans is 8 clusters. But taking into account the size of the data base, we consider that 8 clusters are too much, so we decide to keep 6.
What we see before about the positive association of the bmi category and occupation with having or not sleep disorder is reflected in the clusters 0 and 1 where orverweight and being a Nurse/Salesperson increase the ratio of sleep disorder.
From cluster number 4 and the association matrix we can infere that the bmi category is the principal driver for having or not a sleep disorder.
| gender | occupation | quality_of_sleep | stress_level | bmi_category | sleep_disorder | sleep_disorder_percentage | |
|---|---|---|---|---|---|---|---|
| 0 | Male | Salesperson | 7 | 7 | Overweight | 1 | 79.22 |
| 1 | Female | Nurse | 6 | 8 | Overweight | 1 | 88.24 |
| 2 | Female | Nurse | 9 | 3 | Overweight | 0 | 47.69 |
| 3 | Male | Doctor | 6 | 8 | Normal | 0 | 44.23 |
| 4 | Male | Doctor | 8 | 4 | Normal | 0 | 4.65 |
| 5 | Male | Lawyer | 8 | 5 | Normal | 0 | 10.00 |
| age | sleep_duration | physical_activity_level | heart_rate | daily_steps | blood_pressure_systolic | blood_pressure_diastolic | |
|---|---|---|---|---|---|---|---|
| 0 | 43.649351 | 6.592208 | 46.168831 | 68.987013 | 5905.194805 | 131.051948 | 86.181818 |
| 1 | 48.470588 | 6.073529 | 88.235294 | 75.000000 | 10000.000000 | 139.117647 | 94.117647 |
| 2 | 55.446154 | 8.256923 | 52.846154 | 66.646154 | 5953.846154 | 132.615385 | 87.615385 |
| 3 | 34.269231 | 6.348077 | 38.326923 | 76.230769 | 4588.461538 | 129.211538 | 83.673077 |
| 4 | 33.988372 | 7.365116 | 63.918605 | 69.162791 | 7502.325581 | 117.941860 | 77.872093 |
| 5 | 40.983333 | 7.551667 | 77.500000 | 68.933333 | 8066.666667 | 129.600000 | 84.666667 |
Multi Dimensional ScalingΒΆ
As alternative to Principal Component Analysis, Multi Dimensional Scaling (MDS) offers a distance based dimension reduction method. Input to MDS are a distance (or dissimilarity) Matrix generated by a chosen distance. This distance has to fulfill the following properties:
- Symmetry (of distance between two points, A->B = B->A)
- Non-Negativity (distance between two points must be non-negative)
- Identity of Indiscernibles (distance A->B is only zero if A=B)
- Triangular Inequality (A->C <= A->B->C)
Distance MatricesΒΆ
in the following, we utilize the Mahalanobis distance and the Gower dissimilarity to set up two distance matrices, one of which will then used to apply MDS.
Plotting the corresponding distance matrix gives a subjective idea in the differences in the distance matrices, while it does not deliver any profunder conclusions on a data point level.
Text(8734.358585858585, 0.5, 'Points')